Computer and Modernization ›› 2013, Vol. 218 ›› Issue (10): 229-232.doi: 10.3969/j.issn.1006-2475.2013.10.056

• 网络与通信 • Previous Articles     Next Articles

Web Information Segmentation Method Based on DOM Structure Tree

ZHOU Jian1, TANG Jin1,2, LUO Bin1,2   

  1. 1. School of Computer Science and Technology, Anhui University, Hefei 230601, China; 2. Key Lab of Industrial Image Processing & Analysis of Anhui Province, Hefei 230039, China
  • Received:2013-05-23 Revised:1900-01-01 Online:2013-10-26 Published:2013-10-26

Abstract:

Correct extraction and segmentation of Web information is significant to text information mining. The paper proposes and achieves a method which can get informative information from Web page and be able to follow the correct segmentation of the original text. The method first uses page layout tag <table> and <div> to build a DOM structure tree, and then uses the nested relations of the layout label, that the DOM structure tree reflects to choose the content blocks, extract text information correctly, and finally achieves information segment of the body through the manipulation of some special tags. The experimental results prove that this method is easy to realize and high efficiency and it can automatically extract informative message and section accurately.

Key words: semantic markup, layout label, segmentation, noise

CLC Number: